TIPS
Write commands in code editor of R Studio and run them using icon
Run in R Studio.
We need several R packages for our analysis today. Some are installed through bioconductor, some from CRAN through install.packages.
Bioconductor
If you have not already done so, install BiocManager for Bioconductor tools:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
Install Bioconductor packages:
BiocManager::install('multtest', update=F)
BiocManager::install("ComplexHeatmap", update=F)
BiocManager::install("circlize", update=F)
BiocManager::install("edgeR", update=F)
CRAN packages
Install Seurat and other CRAN packages:
install.packages('Seurat')
install.packages('Matrix')
install.packages('tidyr')
install.packages('dplyr')
install.packages('ggplot2')
Presto package for fast wilcox test (recommended)
Install devtools first if you have not done before:
install.packages('devtools')
Install presto from GitHub:
devtools::install_github('immunogenomics/presto')
Check installations
library(Seurat)
library(multtest)
library(circlize)
library(ComplexHeatmap)
library(dplyr)
library(tidyr)
library(edgeR)
library(ggplot2)
library(Matrix)
library(presto)
Open this link in a web browser: https://wd.cri.uic.edu/scrna/10X_example_report.html
Practice with inDrop data (full matrix) and 10X data (sparse matrix)
library(Seurat)
library(Matrix)
library(dplyr)
counts_indrop <- read.delim("https://wd.cri.uic.edu/scrna/inDrop_counts.txt",
row.names=1)
sparse_indrop <- Matrix(as.matrix(counts_indrop),sparse=T)
format(object.size(counts_indrop), units="auto")
[1] "46.8 Mb"
format(object.size(sparse_indrop), units="auto")
[1] "13.4 Mb"
seurat_indrop <- CreateSeuratObject(counts=sparse_indrop, project="indrop_sample_data")
seurat_indrop
An object of class Seurat
9435 features across 1275 samples within 1 assay
Active assay: RNA (9435 features, 0 variable features)
1 layer present: counts
Downloads folder, execute the following.sparse_10X <- Read10X("~/Downloads/10X_test/filtered_feature_bc_matrix/",
gene.column=1)
Downloads folder, execute the
following.sparse_10X <- Read10X("~/../Downloads/10X_test/filtered_feature_bc_matrix/",
gene.column=1)seurat_10X <- CreateSeuratObject(counts=sparse_10X, project="10X_sample_data")
seurat_10X
An object of class Seurat
31053 features across 1301 samples within 1 assay
Active assay: RNA (31053 features, 0 variable features)
1 layer present: counts
NOTE: skip this part, try at home later. Skip to Cell QC steps.
Now we will analyze a larger 10X data set. These are data from a published study, Martos SN, Campbell MR, Lozoya OA, Wang X et al. Single-cell analyses identify dysfunctional CD16+ CD8 T cells in smokers. Cell Rep Med 2020 Jul 21;1(4). PMID: 33163982.
Please download zip files with the data first:
Unzip these on your computer. Note the folder these are downloaded to. The exercises below assume they are in your ~/Downloads folder.
Ensure the Seurat package is loaded.
library(Seurat)
Read the 10X files
Downloads folder, execute the following.V035_F031 <- Read10X("~/Downloads/GSE138867/sub_V035_F031", gene.column=1)
V039_F093 <- Read10X("~/Downloads/GSE138867/sub_V039_F093", gene.column=1)
V040_F198 <- Read10X("~/Downloads/GSE138867/sub_V040_F198", gene.column=1)
V041_F043 <- Read10X("~/Downloads/GSE138867/sub_V041_F043", gene.column=1)
V043_F047 <- Read10X("~/Downloads/GSE138867/sub_V043_F047", gene.column=1)
V044_F037 <- Read10X("~/Downloads/GSE138867/sub_V044_F037", gene.column=1)
V045_F120 <- Read10X("~/Downloads/GSE138867/sub_V045_F120", gene.column=1)
V054_F209 <- Read10X("~/Downloads/GSE138867/sub_V054_F209", gene.column=1)
Downloads folder, execute the
following.V035_F031 <- Read10X("~/../Downloads/GSE138867/sub_V035_F031", gene.column=1)
V039_F093 <- Read10X("~/../Downloads/GSE138867/sub_V039_F093", gene.column=1)
V040_F198 <- Read10X("~/../Downloads/GSE138867/sub_V040_F198", gene.column=1)
V041_F043 <- Read10X("~/../Downloads/GSE138867/sub_V041_F043", gene.column=1)
V043_F047 <- Read10X("~/../Downloads/GSE138867/sub_V043_F047", gene.column=1)
V044_F037 <- Read10X("~/../Downloads/GSE138867/sub_V044_F037", gene.column=1)
V045_F120 <- Read10X("~/../Downloads/GSE138867/sub_V045_F120", gene.column=1)
V054_F209 <- Read10X("~/../Downloads/GSE138867/sub_V054_F209", gene.column=1)
Create the Seurat objects
scV035_F031 <- CreateSeuratObject(counts=V035_F031, project="V035_F031")
scV039_F093 <- CreateSeuratObject(counts=V039_F093, project="V039_F093")
scV040_F198 <- CreateSeuratObject(counts=V040_F198, project="V040_F198")
scV041_F043 <- CreateSeuratObject(counts=V041_F043, project="V041_F043")
scV043_F047 <- CreateSeuratObject(counts=V043_F047, project="V043_F047")
scV044_F037 <- CreateSeuratObject(counts=V044_F037, project="V044_F037")
scV045_F120 <- CreateSeuratObject(counts=V045_F120, project="V045_F120")
scV054_F209 <- CreateSeuratObject(counts=V054_F209, project="V054_F209")
Combine the Seurat objects together to make one master data set. The
add.cell.ids parameter specifies a prefix to add to the
cell IDs when combining the data. This avoids situations where different
cell barcodes match from different samples.
sc_data <- merge(scV035_F031,
y=c(scV039_F093,scV040_F198,scV041_F043,scV043_F047,
scV044_F037,scV045_F120,scV054_F209),
add.cell.ids=c("s1","s2","s3","s4","s5","s6","s7","s8"))
Read the prepared RDS file, which reflects all of the steps in the last sub-exercise.
Note that orig.ident is how we’re keeping track of which
file came from which sample. Add group metadata to the Seurat
object.
sc_data <- readRDS(url("https://wd.cri.uic.edu/scrna/GSE138867.rds"))
Join the counts layers across all samples.
sc_data <- JoinLayers(sc_data)
Count cells per sample.
head(sc_data@meta.data)
orig.ident nCount_RNA nFeature_RNA
s1_AAACGGGAGACTGTAA-1 V035_F031 5201 1237
s1_AAACGGGTCTCAAACG-1 V035_F031 3909 1169
s1_AAAGTAGGTAGCAAAT-1 V035_F031 5026 1335
s1_AAAGTAGGTCCTCCAT-1 V035_F031 9112 2393
s1_AAAGTAGTCCGCAAGC-1 V035_F031 4087 1259
s1_AAATGCCAGCGTTCCG-1 V035_F031 2716 88
table(sc_data$orig.ident)
V035_F031 V039_F093 V040_F198 V041_F043 V043_F047 V044_F037 V045_F120 V054_F209
1000 1000 1000 1000 1000 1000 1000 1000
# add in sample group metadata
metadata <- read.delim("https://wd.cri.uic.edu/scrna/GSE138867_metadata.txt",row.names=1)
sc_data@meta.data$group <- metadata[sc_data@meta.data$orig.ident,"Group"]
head(sc_data@meta.data)
orig.ident nCount_RNA nFeature_RNA group
s1_AAACGGGAGACTGTAA-1 V035_F031 5201 1237 Smoker
s1_AAACGGGTCTCAAACG-1 V035_F031 3909 1169 Smoker
s1_AAAGTAGGTAGCAAAT-1 V035_F031 5026 1335 Smoker
s1_AAAGTAGGTCCTCCAT-1 V035_F031 9112 2393 Smoker
s1_AAAGTAGTCCGCAAGC-1 V035_F031 4087 1259 Smoker
s1_AAATGCCAGCGTTCCG-1 V035_F031 2716 88 Smoker
Add in gene metadata (gene symbols).
genes_to_ids <- read.table("https://wd.cri.uic.edu/scrna/features.tsv",
row.names=1, col.names=c("","gene.symbol"))
head(genes_to_ids)
gene.symbol
ENSG00000243485 MIR1302-10
ENSG00000237613 FAM138A
ENSG00000186092 OR4F5
ENSG00000238009 RP11-34P13.7
ENSG00000239945 RP11-34P13.8
ENSG00000237683 AL627309.1
sc_data[["RNA"]] <- AddMetaData(sc_data[["RNA"]],
genes_to_ids[Features(sc_data),,drop=F])
Read in a list of the coordinates for each gene so, that we know which genes are on chrM
genes <- read.table("https://wd.cri.uic.edu/scrna/hg19_genes.bed")
dim(genes)
[1] 43319 6
# column 4 have the gene IDs, and column 1 the chromosome
head(genes)
V1 V2 V3 V4 V5 V6
1 chrX 99883667 99894988 ENSG00000000003 0 -
2 chrX 99839799 99854882 ENSG00000000005 0 +
3 chr20 49551404 49575092 ENSG00000000419 0 -
4 chr1 169818772 169863408 ENSG00000000457 0 -
5 chr1 169631245 169823221 ENSG00000000460 0 +
6 chr1 27938575 27961788 ENSG00000000938 0 -
Make a list of the mitochondrial genes: all gene IDs on the mitochondrial chromosome.
mt.genes <- as.character(genes[genes[,1]=="chrM", 4])
length(mt.genes)
[1] 13
Compute percent mitochondrial expression per cell
sc_data[["percent.MT"]] <- PercentageFeatureSet(sc_data, features=mt.genes)
NOTE: There are a few ways to get a list of the mitochondrial gene IDs for your genome.
https://genome.ucsc.edu/cgi-bin/hgTables, just make sure to
get the one that matches the gene IDs in your file.Violin plots of the distribution of counts, genes, and mitochondrial expression
VlnPlot(sc_data, features = c("nFeature_RNA","nCount_RNA","percent.MT"), pt.size=0.2)
Warning: Default search for "data" layer in "RNA" assay yielded no results;
utilizing "counts" layer instead.
Scatterplot of number of UMI counts vs % mitochondrial expression. Note that most cells with high levels of mitochrondrial expression have very low UMI counts.
FeatureScatter(sc_data, feature1="nCount_RNA", feature2="percent.MT")
Based on the violin plots and scatterplots, decide on a reasonable threshold for filtering out the “bad” cells.
For today, our passing cells must have:
sc_subset <- subset(sc_data,
nFeature_RNA>500 & nCount_RNA>2000 & percent.MT < 10)
sc_subset
An object of class Seurat
32738 features across 7392 samples within 1 assay
Active assay: RNA (32738 features, 0 variable features)
1 layer present: counts
NOTE: skip this part, try at home later. Skip to Doublet removal.
Run a doublet check with the SCDS library. We want to do this after the cell QC steps, but we should also analyze each sample separately.
library(scds)
library(SingleCellExperiment)
samples <- unique(sc_subset@meta.data$orig.ident)
doublets_df <- data.frame()
for(s in samples){
# subset to this sample
this_sc <- subset(sc_subset, orig.ident == s)
# get counts
this_counts <- GetAssayData(this_sc, layer="counts")
# set up SingleCellExperiment object
this_sce <- SingleCellExperiment(list(counts=this_counts))
# run hybrid method from scds
cxds <- cxds_bcds_hybrid(this_sce)
# make data frame of results
this_df <- data.frame("Barcode"=names(cxds$hybrid_score),
"Score" = cxds$hybrid_score,
"Sample" = s)
doublets_df <- rbind(doublets_df, this_df)
}
To save time, read in the results of the doublet analysis.
We will identify the top 5% of cells from each sample as doublets.
doublets_df <- read.delim("https://wd.cri.uic.edu/scrna/doublet_analysis.txt")
head(doublets_df)
Barcode Score Sample
1 s1_AAACGGGAGACTGTAA-1 0.2275562 V035_F031
2 s1_AAACGGGTCTCAAACG-1 0.2172869 V035_F031
3 s1_AAAGTAGGTAGCAAAT-1 0.5191589 V035_F031
4 s1_AAAGTAGGTCCTCCAT-1 1.4636403 V035_F031
5 s1_AAAGTAGTCCGCAAGC-1 0.2578670 V035_F031
6 s1_AAATGCCTCTTGAGGT-1 0.4076141 V035_F031
library(dplyr)
doublet_cells <- doublets_df %>%
group_by(Sample) %>%
slice_max(order_by=Score, prop=0.05)
# count how many cells we selected per sample
table(doublet_cells$Sample)
V035_F031 V039_F093 V040_F198 V041_F043 V043_F047 V044_F037 V045_F120 V054_F209
47 46 48 47 46 43 45 43
# add a doublet column to the metadata and filter
sc_subset@meta.data$IsDoublet <- rownames(sc_subset@meta.data) %in% doublet_cells$Barcode
head(sc_subset@meta.data)
orig.ident nCount_RNA nFeature_RNA group percent.MT
s1_AAACGGGAGACTGTAA-1 V035_F031 5201 1237 Smoker 3.653144
s1_AAACGGGTCTCAAACG-1 V035_F031 3909 1169 Smoker 3.223331
s1_AAAGTAGGTAGCAAAT-1 V035_F031 5026 1335 Smoker 1.929964
s1_AAAGTAGGTCCTCCAT-1 V035_F031 9112 2393 Smoker 3.435031
s1_AAAGTAGTCCGCAAGC-1 V035_F031 4087 1259 Smoker 4.061659
s1_AAATGCCTCTTGAGGT-1 V035_F031 5521 1292 Smoker 1.829379
IsDoublet
s1_AAACGGGAGACTGTAA-1 FALSE
s1_AAACGGGTCTCAAACG-1 FALSE
s1_AAAGTAGGTAGCAAAT-1 FALSE
s1_AAAGTAGGTCCTCCAT-1 TRUE
s1_AAAGTAGTCCGCAAGC-1 FALSE
s1_AAATGCCTCTTGAGGT-1 FALSE
sc_subset <- subset(sc_subset, IsDoublet == F)
Check what fraction of cells we kept from each sample, using the
table() function
# counts before filtering
orig.counts <- table(sc_data$orig.ident)
# counts after filtering
subset.counts <- table(sc_subset$orig.ident)
# combine and compare counts
cell_stats <- cbind("Starting Cells" = orig.counts,
"Retained Cells" = subset.counts,
"Fraction" = subset.counts/orig.counts)
cell_stats
Starting Cells Retained Cells Fraction
V035_F031 1000 894 0.894
V039_F093 1000 891 0.891
V040_F198 1000 923 0.923
V041_F043 1000 902 0.902
V043_F047 1000 879 0.879
V044_F037 1000 828 0.828
V045_F120 1000 874 0.874
V054_F209 1000 836 0.836
If you need to catch up, read in the RDS file that is consistent with the QC steps above:
sc_subset <- readRDS(url("https://wd.cri.uic.edu/scrna/postQC.rds"))
sc_subset <- NormalizeData(sc_subset)
Normalizing layer: counts
sc_subset <- FindVariableFeatures(sc_subset, nfeatures=4000)
Finding variable features for layer counts
VariableFeaturePlot(sc_subset)
Warning in scale_x_log10(): log-10 transformation introduced infinite values.
ScaleData function will only scale the features that are
identified as variable via the FindVariableFeatures()
function.sc_subset <- ScaleData(sc_subset)
Centering and scaling data matrix
npcs = 50 is the default. You may want to
increase for more complex data sets. Also set “verbose=F”
otherwise Seurat will print out a number of statistics to the
console.sc_subset <- RunPCA(sc_subset, npcs = 50, verbose=F)
ElbowPlot(sc_subset, ndims=50)
balanced=T). We want to examine each PC’s heatmap to see
at what point the “signal” starts to go away.NOTE: If you get warnings about figure margins too large, try making the plot pane in your R studio bigger
DimHeatmap(sc_subset, dims=1:50, cells=300, balanced=T)
DO NOT RUN THESE STEPS TODAY as they are computationally demanding. We will review the commands and results in the exercise handout.
JackStraw is a method for computing p-values for principal components. As implemented in Seurat, the steps are:
JackStraw() computes projected PCA
scores for a random permutation of a subset of the data, and compares
these values to the observed PCA scores for the genes. This allows the
computation of a p-value for the association of each gene with each
PC.ScoreJackStraw() computes p-values for
the PCs based on the distribution of p-values obtained for all genes
against that PC from the JackStraw step.JackStrawPlot() makes a diagnostic
plot we can use to evaluate the signal level and significance of each
PC.NOTE: The selection of the number of PCs to use (dims parameter) is different for the different functions. For JackStraw() you just give a number (e.g., 50), and for the other two you give a range (e.g., 1:50).
sc_subset <- JackStraw(sc_subset, num.replicate = 100, dims = 50)
sc_subset <- ScoreJackStraw(sc_subset, dims = 1:50)
JackStrawPlot(sc_subset, dims = 1:50)
These are Q-Q plots: PC curves that are more diverged from the black dashed diagonal line have a higher level of signal. P-values are given in the legend.
FindNeighbors(). For this exercise we will
use the first 21 PCs to determine the distance between the cells and
thus who are the “neighbors”pca.dims <- 1:21
sc_subset <- FindNeighbors(sc_subset, dims=pca.dims)
Computing nearest neighbor graph
Computing SNN
sc_subset <- FindClusters(sc_subset, resolution=0.5)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 7027
Number of edges: 255224
Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8817
Number of communities: 12
Elapsed time: 0 seconds
# see where clusters are stored
head(sc_subset@meta.data)
orig.ident nCount_RNA nFeature_RNA group percent.MT
s1_AAACGGGAGACTGTAA-1 V035_F031 5201 1237 Smoker 3.653144
s1_AAACGGGTCTCAAACG-1 V035_F031 3909 1169 Smoker 3.223331
s1_AAAGTAGGTAGCAAAT-1 V035_F031 5026 1335 Smoker 1.929964
s1_AAAGTAGTCCGCAAGC-1 V035_F031 4087 1259 Smoker 4.061659
s1_AAATGCCTCTTGAGGT-1 V035_F031 5521 1292 Smoker 1.829379
s1_AACACGTCAACTGCTA-1 V035_F031 3614 804 Smoker 2.102933
IsDoublet RNA_snn_res.0.5 seurat_clusters
s1_AAACGGGAGACTGTAA-1 FALSE 10 10
s1_AAACGGGTCTCAAACG-1 FALSE 2 2
s1_AAAGTAGGTAGCAAAT-1 FALSE 0 0
s1_AAAGTAGTCCGCAAGC-1 FALSE 3 3
s1_AAATGCCTCTTGAGGT-1 FALSE 3 3
s1_AACACGTCAACTGCTA-1 FALSE 0 0
sc_subset <- RunUMAP(sc_subset, dims=pca.dims)
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
13:55:16 UMAP embedding parameters a = 0.9922 b = 1.112
13:55:16 Read 7027 rows and found 21 numeric columns
13:55:16 Using Annoy for neighbor search, n_neighbors = 30
13:55:16 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
13:55:16 Writing NN index file to temp file /tmp/Rtmpz7m5F6/file2a7a1d51e1f1d8
13:55:16 Searching Annoy index using 1 thread, search_k = 3000
13:55:18 Annoy recall = 100%
13:55:18 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
13:55:19 Initializing from normalized Laplacian + noise (using RSpectra)
13:55:19 Commencing optimization for 500 epochs, with 289920 positive edges
13:55:26 Optimization finished
# see where UMAP coordinates are stored
head(sc_subset@reductions$umap@cell.embeddings)
umap_1 umap_2
s1_AAACGGGAGACTGTAA-1 -11.9474511 6.451997
s1_AAACGGGTCTCAAACG-1 2.0845266 -3.441252
s1_AAAGTAGGTAGCAAAT-1 0.1690202 2.425961
s1_AAAGTAGTCCGCAAGC-1 0.5361578 -2.287897
s1_AAATGCCTCTTGAGGT-1 3.4209168 -1.337700
s1_AACACGTCAACTGCTA-1 4.7771568 1.230965
DimPlot(sc_subset, reduction='umap', label=T)
Save our work (Recommended!)
saveRDS(sc_subset,"clustered.rds")
If you closed R, then reload the sc_subset Seurat
object. Otherwise skip this step.
library(Seurat)
sc_subset <- readRDS("clustered.rds")
If you closed R and forgot to save your Seurat object, you can
read it from our server. Otherwise skip this step. NOTE
Use the url() function when reading an RDS from the
web.
library(Seurat)
sc_subset <- readRDS(url("https://wd.cri.uic.edu/scrna/clustered.rds"))
Look at cell cycle first. The Seurat library also loads genes associated with cell cycle. We will use those, though you may want to consider reviewing the list when you run an analysis.
We will use one of the features.tsv files from CellRanger to map
Ensembl IDs to gene symbols. If you added the gene symbols to the Seurat
object’s [[RNA]] meta.data earlier you can also pull this
information from there.
# load gene symbols for cell cycle
s.genes <- cc.genes$s.genes
g2m.genes <- cc.genes$g2m.genes
head(s.genes)
[1] "MCM5" "PCNA" "TYMS" "FEN1" "MCM2" "MCM4"
head(g2m.genes)
[1] "HMGB2" "CDK1" "NUSAP1" "UBE2C" "BIRC5" "TPX2"
Get ID to gene symbol conversion table.
genes_to_ids <- data.frame(Features(sc_subset),
sc_subset[["RNA"]]@meta.data["gene.symbol"])
## or ##
genes_to_ids <- read.table("https://wd.cri.uic.edu/scrna/features.tsv")
Then map gene symbols to Ensembl IDs.
s.genes_ids <- merge(s.genes, genes_to_ids, by.x=1, by.y=2)
g2m.genes_ids <- merge(g2m.genes, genes_to_ids, by.x=1, by.y=2)
head(s.genes_ids)
x V1
1 ATAD2 ENSG00000156802
2 BLM ENSG00000197299
3 BRIP1 ENSG00000136492
4 CCNE2 ENSG00000175305
5 CDC45 ENSG00000093009
6 CDC6 ENSG00000094804
# cell cycle scoring
sc_subset <- CellCycleScoring(sc_subset,
s.features=s.genes_ids[,2], g2m.features=g2m.genes_ids[,2])
head(sc_subset@meta.data)
orig.ident nCount_RNA nFeature_RNA group percent.MT
s1_AAACGGGAGACTGTAA-1 V035_F031 5201 1237 Smoker 3.653144
s1_AAACGGGTCTCAAACG-1 V035_F031 3909 1169 Smoker 3.223331
s1_AAAGTAGGTAGCAAAT-1 V035_F031 5026 1335 Smoker 1.929964
s1_AAAGTAGTCCGCAAGC-1 V035_F031 4087 1259 Smoker 4.061659
s1_AAATGCCTCTTGAGGT-1 V035_F031 5521 1292 Smoker 1.829379
s1_AACACGTCAACTGCTA-1 V035_F031 3614 804 Smoker 2.102933
IsDoublet RNA_snn_res.0.5 seurat_clusters S.Score
s1_AAACGGGAGACTGTAA-1 FALSE 10 10 -0.04733959
s1_AAACGGGTCTCAAACG-1 FALSE 2 2 -0.06362626
s1_AAAGTAGGTAGCAAAT-1 FALSE 0 0 0.01897592
s1_AAAGTAGTCCGCAAGC-1 FALSE 3 3 -0.02388267
s1_AAATGCCTCTTGAGGT-1 FALSE 3 3 0.02537982
s1_AACACGTCAACTGCTA-1 FALSE 0 0 -0.03449400
G2M.Score Phase
s1_AAACGGGAGACTGTAA-1 -0.02549001 G1
s1_AAACGGGTCTCAAACG-1 -0.06068320 G1
s1_AAAGTAGGTAGCAAAT-1 -0.05418265 S
s1_AAAGTAGTCCGCAAGC-1 -0.04009162 G1
s1_AAATGCCTCTTGAGGT-1 -0.02018164 S
s1_AACACGTCAACTGCTA-1 -0.03611933 G1
# visualize cell cycle in UMAP
DimPlot(sc_subset, group.by="Phase")
# visualize scores versus clusters in boxplot
library(ggplot2)
ggplot(sc_subset@meta.data, aes(x=RNA_snn_res.0.5, y=S.Score)) +
geom_boxplot()
ggplot(sc_subset@meta.data, aes(x=RNA_snn_res.0.5, y=G2M.Score)) +
geom_boxplot()
In this case, it does not look like we had cell cycle bias.
If we did, and we decided it was important to remove it, we would go
back to the ScaleData step to correct for it (don’t
run this now):
sc_subset2 <- ScaleData(sc_subset, vars.to.regress = c("S.Score", "G2M.Score"))
Then rerun PCA and clustering steps, and re-evaluate after clustering the data again.
Evaluate batch effect by looking at each sample on the UMAP plot.
# UMAP plot by cluster and sample
DimPlot(sc_subset, group.by="seurat_clusters",split.by="orig.ident")
# Number of cells per sample per cluster
table(sc_subset@meta.data[,c("orig.ident","seurat_clusters")])
seurat_clusters
orig.ident 0 1 2 3 4 5 6 7 8 9 10 11
V035_F031 323 185 83 124 16 12 23 31 19 43 35 0
V039_F093 222 172 65 13 1 1 170 129 58 13 40 7
V040_F198 17 21 219 38 5 482 16 24 24 27 44 6
V041_F043 34 6 65 163 339 5 108 50 65 57 9 1
V043_F047 288 190 111 77 32 2 38 29 61 26 23 2
V044_F037 7 8 172 238 182 4 41 67 52 17 30 10
V045_F120 313 80 114 93 4 5 106 63 59 8 26 3
V054_F209 55 424 112 44 1 5 8 28 76 56 21 6
There are some sample-specific differences, but the samples are generally occupying the same spaces in the UMAP plot, so this is probably not an issue.
If we did want to correct for batch effect, these are the steps (don’t run this now):
# split object based on the orig.ident, or
# whatever batch factor is appropriate
sc_split <- SplitObject(sc_subset, split.by="orig.ident")
# run normalization and variant feature selection separately for each sample
sc_split <- lapply(sc_split, function(x){
x <- NormalizeData(x)
x <- FindVariableFeatures(x)})
# find integration features
integration_genes <- SelectIntegrationFeatures(sc_split)
# run integration
integration_anchors <- FindIntegrationAnchors(sc_split, anchor.features=integration_genes)
sc_integrated <- IntegrateData(anchorset = integration_anchors)
# this creates a NEW assay, "integrated"
# use this instead of the "RNA" assay
DefaultAssay(sc_integrated) <- "integrated"
At this point you would rerun the analysis from the
ScaleData step, and re-evalute the samples on the new
clustering results and UMAP plot. Note though that the
integrated assay layer ONLY has the variable genes that
were identified before; it does not have the full transcriptome.
Refer to take-home exercise 3.1 for examples in generating different clustering results and comparing those for similarities and differences.
We will try two strategies:
Read in the clustering results if necessary:
sc_subset <- readRDS(url("https://wd.cri.uic.edu/scrna/clustered.rds"))
We have prepared a small list of marker genes and their cell types for common immune cell types. In practice, you would want to derive your own cell marker list based on the types of cells you anticipate finding in your data set, the level of specificity you’re looking for in defining cell types (e.g., generic T cell, vs T helper, T reg, T responder, etc.), and what markers you consider to be specific and informative for those cell types.
markers <- read.delim("https://wd.cri.uic.edu/scrna/blood_markers.txt")
markers
Marker Cell.Type
1 CD79A B cells
2 MS4A1 B cells
3 BCL11A B cells
4 IFITM3 Monocytes
5 PSAP Monocytes
6 CST3 Monocytes
7 LYZ Monocytes
8 GNLY NK cells
9 FCGR3A NK cells
10 CD3D T cells
11 CD3E T cells
12 CD3G T cells
13 SELL T cells
Get ID to gene symbol conversion table.
genenames <- data.frame(Features(sc_subset),
sc_subset[["RNA"]]@meta.data["gene.symbol"])
## or
genenames <- read.table("https://wd.cri.uic.edu/scrna/features.tsv")
Make named vector for marker genes to symbols.
gene_to_id <- genenames[,1]
names(gene_to_id) <- make.unique(genenames[,2])
markers$ID <- gene_to_id[markers$Marker]
markers
Marker Cell.Type ID
1 CD79A B cells ENSG00000105369
2 MS4A1 B cells ENSG00000156738
3 BCL11A B cells ENSG00000119866
4 IFITM3 Monocytes ENSG00000142089
5 PSAP Monocytes ENSG00000197746
6 CST3 Monocytes ENSG00000101439
7 LYZ Monocytes ENSG00000090382
8 GNLY NK cells ENSG00000115523
9 FCGR3A NK cells ENSG00000203747
10 CD3D T cells ENSG00000167286
11 CD3E T cells ENSG00000198851
12 CD3G T cells ENSG00000160654
13 SELL T cells ENSG00000188404
Use DotPlot from Seurat to look at expression levels per cluster.
# get unique list of genes
DotPlot(sc_subset, features=markers$ID, group.by="RNA_snn_res.0.5" ) + RotatedAxis()
The DotPlot would be more useful if we could label the x-axis with gene symbols and cell type descriptions. Let’s do that.
# format the marker names as "gene: cell type"
# - this will make a named vector
# - the values are the labels to plot
# - the names are the gene IDs
markers_description = paste0(markers$Marker,": ", markers$Cell.Type)
names(markers_description) = markers$ID
markers_description
ENSG00000105369 ENSG00000156738 ENSG00000119866 ENSG00000142089
"CD79A: B cells" "MS4A1: B cells" "BCL11A: B cells" "IFITM3: Monocytes"
ENSG00000197746 ENSG00000101439 ENSG00000090382 ENSG00000115523
"PSAP: Monocytes" "CST3: Monocytes" "LYZ: Monocytes" "GNLY: NK cells"
ENSG00000203747 ENSG00000167286 ENSG00000198851 ENSG00000160654
"FCGR3A: NK cells" "CD3D: T cells" "CD3E: T cells" "CD3G: T cells"
ENSG00000188404
"SELL: T cells"
# Add custom x labels to our plot
library(ggplot2)
DotPlot(sc_subset, features=markers$ID, group.by="RNA_snn_res.0.5" ) +
RotatedAxis() +
scale_x_discrete(labels=function(x) markers_description)
You can evaluate the markers manually to make an assessment of the cell type per cluster. For more examples of visualizations of these markers, and an automated approach for assigning cell types, look at take-home exercise 3.2.
We need an external single-cell data set whose cell types we trust. This is typically from published data sets, and we need to find a source where the authors have published not only the gene expression matrix files (as required), but also the cell type metadata (not always included).
Then you would load these files into a Seurat object, run normalization as above, and add the cell types to the single-cell metadata.
For this exercise, we are providing data from:
Single-cell RNA-Seq analysis reveals cell subsets and gene signatures associated with rheumatoid arthritis disease activity, Binvignat et al., JCI Insight, 2024 Aug 22; 9(16): e178499.
# read in Seurat object with data
sc_ref <- readRDS(url("https://wd.cri.uic.edu/scrna/sc_reference.rds"))
# run normalization and find variable features
sc_ref <- NormalizeData(sc_ref)
sc_ref <- FindVariableFeatures(sc_ref, nfeatures=4000)
# confirm that the gene identifiers match, and subset to genes in common
head(rownames(sc_ref))
[1] "ENSG00000186092" "ENSG00000238009" "ENSG00000239945" "ENSG00000241599"
[5] "ENSG00000229905" "ENSG00000237491"
head(rownames(sc_subset))
[1] "ENSG00000243485" "ENSG00000237613" "ENSG00000186092" "ENSG00000238009"
[5] "ENSG00000239945" "ENSG00000237683"
common.genes <- intersect(rownames(sc_ref),rownames(sc_subset))
length(common.genes)
[1] 20667
# cell types we can transfer
table(sc_ref@meta.data$rough_annot)
Bcells CD4Tcells CD8Tcells Monocytes NKcells
5776 21333 3142 9497 3846
table(sc_ref@meta.data$fine_annot)
CD4 T central memory CD4 T effector memory CD4 T IFIT
8739 5049 239
CD4 T Naive yd T cells CD8 T early Tem
6724 582 915
CD8 T Naive CD8 TEMRA Classical Monocytes
1164 1063 3653
IFITM3 Monocytes IL1b-Monocytes Myeloid DCs
246 2280 2429
Non-classical Monocytes NKCD56bright NKCD56low
889 3039 807
Memory Bcells Naive Bcells Plasmablasts
2462 2739 575
We will use the rough_annot for now, but you can repeat
the same process with fine_annot if you want.
# run the label transfer (transfer anchors) with Seurat
anchors <- FindTransferAnchors(reference=sc_ref, query=sc_subset, dims=1:30)
Performing PCA on the provided reference using 3776 features as input.
Projecting cell embeddings
Finding neighborhoods
Finding anchors
Found 5948 anchors
predictions <- TransferData(anchorset=anchors, refdata=sc_ref@meta.data$rough_annot)
Finding integration vectors
Finding integration vector weights
Predicting cell labels
# compute the majority predicted cell type per cluster
predictions$Cluster <- sc_subset@meta.data$RNA_snn_res.0.5
prediction_summary <- table(predictions[,c("Cluster","predicted.id")])
prediction_summary
predicted.id
Cluster Bcells CD4Tcells CD8Tcells Monocytes NKcells
0 0 1255 4 0 0
1 0 941 145 0 0
2 0 13 855 0 73
3 0 696 94 0 0
4 0 579 1 0 0
5 1 495 20 0 0
6 0 506 4 0 0
7 0 0 0 421 0
8 0 1 0 0 413
9 247 0 0 0 0
10 228 0 0 0 0
11 0 0 0 35 0
prediction_type <- apply(prediction_summary,1,function(x){
colnames(prediction_summary)[which.max(x)]})
prediction_type
0 1 2 3 4 5
"CD4Tcells" "CD4Tcells" "CD8Tcells" "CD4Tcells" "CD4Tcells" "CD4Tcells"
6 7 8 9 10 11
"CD4Tcells" "Monocytes" "NKcells" "Bcells" "Bcells" "Monocytes"
# assign these cell types in the Seurat object
sc_subset@meta.data$CellType <- prediction_type[sc_subset@meta.data$RNA_snn_res.0.5]
DimPlot(sc_subset, group.by = "CellType")
# compare to the clusters
DimPlot(sc_subset, reduction='umap', label=T)
Run differential expression between clusters using the Wilcox test, running comparisons as one cluster vs all other cells. Only test genes that are expressed in at least 20% of cells (default is 1%).
degs <- FindAllMarkers(sc_subset, test.use="wilcox", min.pct=0.2)
Calculating cluster 0
Calculating cluster 1
Calculating cluster 2
Calculating cluster 3
Calculating cluster 4
Calculating cluster 5
Calculating cluster 6
Calculating cluster 7
Calculating cluster 8
Calculating cluster 9
Calculating cluster 10
Calculating cluster 11
head(degs)
p_val avg_log2FC pct.1 pct.2 p_val_adj cluster
ENSG00000145425 2.836137e-250 0.5026214 1.000 1.000 9.284946e-246 0
ENSG00000196154 1.550552e-233 -2.3296592 0.604 0.872 5.076198e-229 0
ENSG00000122026 6.752797e-233 0.4360492 1.000 1.000 2.210731e-228 0
ENSG00000166710 1.552457e-229 -0.6831864 1.000 1.000 5.082434e-225 0
ENSG00000144713 2.897885e-217 0.4282546 1.000 1.000 9.487095e-213 0
ENSG00000163682 2.439887e-196 0.4691579 1.000 1.000 7.987701e-192 0
gene
ENSG00000145425 ENSG00000145425
ENSG00000196154 ENSG00000196154
ENSG00000122026 ENSG00000122026
ENSG00000166710 ENSG00000166710
ENSG00000144713 ENSG00000144713
ENSG00000163682 ENSG00000163682
In this table
p_val is from Wilcox testavg_log2FC is the log2 fold-change between cells in
this cluster and all other cellspct.1 is the fraction (out of 1) of cells in this
cluster that express the genepct.2 is the fraction of all other cells that express
the genep_val_adj is the adjusted p-value from
Bonferroni (see note below)cluster is the cluster in question; the other clusters
will appear farther down this tablegene is the gene in questionNOTE on p_val_adj: These p-values are biased, as the
clustering analysis intentionally divides cells based on differences in
their transcriptome. They are also biased because this function
pre-filters genes to test based on a minimum log2FC. Therefore, we fully
expect to identify a large number of “significant” differences by
design. It’s hard to correctly adjust for the expected distribution of
p-values in this circumstance, so Seurat uses the overly conservative
Bonferroni correction to try to over-adjust in a very rough
sense.
Add in gene symbols, which we can find in any of the features files from CellRanger, or in the Seurat object’s RNA feature metadata (if you included it).
Get ID to gene symbol conversion table, this time using the IDs as rownames.
genenames <- sc_subset[["RNA"]]@meta.data["gene.symbol"]
rownames(genenames) = Features(sc_subset)
## or
genenames <- read.table("https://wd.cri.uic.edu/scrna/features.tsv",row.names=1)
Add gene symbols to DEGs table.
degs$symbol <- genenames[degs$gene, 1]
head(degs)
p_val avg_log2FC pct.1 pct.2 p_val_adj cluster
ENSG00000145425 2.836137e-250 0.5026214 1.000 1.000 9.284946e-246 0
ENSG00000196154 1.550552e-233 -2.3296592 0.604 0.872 5.076198e-229 0
ENSG00000122026 6.752797e-233 0.4360492 1.000 1.000 2.210731e-228 0
ENSG00000166710 1.552457e-229 -0.6831864 1.000 1.000 5.082434e-225 0
ENSG00000144713 2.897885e-217 0.4282546 1.000 1.000 9.487095e-213 0
ENSG00000163682 2.439887e-196 0.4691579 1.000 1.000 7.987701e-192 0
gene symbol
ENSG00000145425 ENSG00000145425 RPS3A
ENSG00000196154 ENSG00000196154 S100A4
ENSG00000122026 ENSG00000122026 RPL21
ENSG00000166710 ENSG00000166710 B2M
ENSG00000144713 ENSG00000144713 RPL32
ENSG00000163682 ENSG00000163682 RPL9
Count the number of up-regulated genes per cluster
(p_val_adj < 0.05 and at least two-fold
up-regulated).
table(degs[degs$p_val_adj<0.05 & degs$avg_log2FC>1, "cluster"])
0 1 2 3 4 5 6 7 8 9 10 11
7 11 84 6 21 48 23 705 359 155 202 603
Get top 5 up-regulated genes per cluster. The group_by()
function will group the data.frame rows by the cluster
column and the slice_max() function will return the top n
(10) rows ranked by the column avg_log2FC.
library(dplyr)
top5 <- degs %>% group_by(cluster) %>% slice_max(n=5, order_by=avg_log2FC)
top5
# A tibble: 60 × 8
# Groups: cluster [12]
p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene symbol
<dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr> <chr>
1 1.46e-119 1.83 0.4 0.133 4.79e-115 0 ENSG00000189283 FHIT
2 3.85e- 62 1.64 0.258 0.092 1.26e- 57 0 ENSG00000182463 TSHZ2
3 1.65e- 50 1.41 0.303 0.139 5.40e- 46 0 ENSG00000179862 CITED4
4 3.02e- 27 1.18 0.225 0.117 9.90e- 23 0 ENSG00000134352 IL6ST
5 2.12e- 32 1.15 0.268 0.141 6.95e- 28 0 ENSG00000186854 TRABD2A
6 6.09e-260 2.92 0.52 0.108 1.99e-255 1 ENSG00000160789 LMNA
7 1.65e-155 1.80 0.516 0.16 5.40e-151 1 ENSG00000129824 RPS4Y1
8 1.49e- 96 1.30 0.643 0.348 4.87e- 92 1 ENSG00000118503 TNFAIP3
9 7.02e- 43 1.30 0.323 0.153 2.30e- 38 1 ENSG00000273319 RP11-138A…
10 9.97e- 33 1.20 0.324 0.175 3.26e- 28 1 ENSG00000090104 RGS1
# ℹ 50 more rows
Quick visualizations:
top5_cluster5 <- top5[top5$cluster==5,]
VlnPlot(sc_subset, features=top5_cluster5$gene, pt.size=F)
NOTE! By default, Seurat will only scale the
variable features identified by FindVariableFeatures().
DoHeatmap(sc_subset, features=top5$gene)
See take-home exercise 3.3 for more examples of visualizations and differential expression between clusters, including using gene symbols in the plots.
logfc.threshold = 0 to test ALL genes with
sufficient minimum expression. This avoids biasing the p-values by only
testing genes with a minimum difference. Since we are testing
within a cluster, the biases in transcriptome due to clustering
are not an issue.NOTE: use a loop in R to perform this analysis over all clusters.
# example analyzing for cluster 1
sc_cluster1 = subset(sc_subset, subset = seurat_clusters == 1)
Idents(sc_cluster1) = "group"
table(Idents(sc_cluster1))
Smoker Non-Smoker
638 448
cluster1_stats <- FindMarkers(sc_cluster1, test.use="wilcox",
ident.1="Smoker", ident.2="Non-Smoker", logfc.threshold=0)
cluster1_stats$p_val_adj <- p.adjust(cluster1_stats$p_val, method="fdr")
cluster1_stats$symbol <- genenames[rownames(cluster1_stats), 1]
head(cluster1_stats)
p_val avg_log2FC pct.1 pct.2 p_val_adj symbol
ENSG00000197728 6.293938e-42 0.5843158 1.000 0.984 6.271909e-38 RPS26
ENSG00000234745 1.665232e-35 -0.4939781 1.000 1.000 8.297017e-32 HLA-B
ENSG00000124614 2.406454e-22 -0.3337910 0.998 1.000 7.993437e-19 RPS10
ENSG00000170315 5.136330e-22 -0.5637031 0.958 0.984 1.279588e-18 UBB
ENSG00000244734 2.882615e-20 4.8212828 0.251 0.040 5.745051e-17 HBB
ENSG00000100097 6.073084e-20 1.2267216 0.618 0.364 1.008638e-16 LGALS1
# count number of significant changes
table(cluster1_stats$p_val_adj < 0.05)
FALSE TRUE
9817 148
# heatmap visualization
degs1 <- rownames(subset(cluster1_stats, p_val_adj<0.05))
norm <- GetAssayData(sc_cluster1, layer="data")
norm <- as.matrix(norm[degs1,])
norm <- t(scale(t(norm)))
library(ComplexHeatmap)
library(circlize)
col_annot <- HeatmapAnnotation(df=sc_cluster1@meta.data["group"])
col_fun <- colorRamp2(c(-2,0,2),c("blue","white","red"))
Heatmap(norm, name="Z-score", top_annotation=col_annot, col=col_fun,
show_column_names=F, row_labels=genenames[rownames(norm),1])
Note: if you save the heatmap to a pdf with height=20
then the gene names will be legible.
See take-home exercise 3.3 for an example of a pseudo-bulk analysis for the same cluster.
Count the cells per group per cluster, and analyze in edgeR.
Skip the TMM calculation so that the normalized values can be expressed as percent.
cell_counts <- table(sc_subset@meta.data[,c("seurat_clusters","orig.ident")])
# obtain metadata and match column order
metadata <- read.delim("https://wd.cri.uic.edu/scrna/GSE138867_metadata.txt",row.names=1)
metadata <- metadata[colnames(cell_counts),]
# analysis in edgeR
library(edgeR)
Loading required package: limma
counts_dge <- DGEList(cell_counts, group=metadata)
counts_dge <- estimateDisp(counts_dge)
Using classic mode.
# check the BCV: it is almost always much higher for cell type abundance comparisons
sqrt(counts_dge$common.disp)
[1] 1.001724
counts_results <- exactTest(counts_dge)$table
counts_results$QValue <- p.adjust(counts_results$PValue)
counts_results
logFC logCPM PValue QValue
0 -1.0981006 17.45593 0.359683196 1.00000000
1 0.5565322 17.27211 0.666302586 1.00000000
2 0.7411713 17.05163 0.291838428 1.00000000
3 0.4165837 16.81623 0.667464155 1.00000000
4 -0.7736547 16.36955 0.661947851 1.00000000
5 5.1668568 16.13844 0.002807115 0.03368538
6 -2.2121969 16.17710 0.008722238 0.09594462
7 -0.8012039 15.92089 0.305316762 1.00000000
8 -0.4479417 15.90518 0.538818524 1.00000000
9 0.4979121 15.18609 0.541651722 1.00000000
10 0.4187248 15.07397 0.553495935 1.00000000
11 0.7830817 12.83111 0.459477230 1.00000000
It looks like clusters 5 and 6 show fairly significant differences (under a more lenient Q<0.1) in relative abundance between groups (note that cluster 0 has >2-fold change, but is not significant beceause of the high variability). Make a boxplot of the significant differences.
percent <- cpm(counts_dge)
# convert to percent (divide CPM by a million)
percent <- percent/1000000
# look at significant changes, add group info
percent_signif <- data.frame(t(percent[counts_results$QValue<0.1,]),
group=metadata,check.names=F)
percent_signif
5 6 group
V035_F031 0.013422819 0.025727069 Smoker
V039_F093 0.001122334 0.190796857 Non-Smoker
V040_F198 0.522210184 0.017334778 Smoker
V041_F043 0.005543237 0.119733925 Non-Smoker
V043_F047 0.002275313 0.043230944 Non-Smoker
V044_F037 0.004830918 0.049516908 Smoker
V045_F120 0.005720824 0.121281465 Non-Smoker
V054_F209 0.005980861 0.009569378 Smoker
# make boxplot
library(ggplot2)
library(tidyr)
Attaching package: 'tidyr'
The following objects are masked from 'package:Matrix':
expand, pack, unpack
percent_signif_long <- pivot_longer(percent_signif,
cols=-group, names_to="cluster", values_to="abundance")
percent_signif_long
# A tibble: 16 × 3
group cluster abundance
<chr> <chr> <dbl>
1 Smoker 5 0.0134
2 Smoker 6 0.0257
3 Non-Smoker 5 0.00112
4 Non-Smoker 6 0.191
5 Smoker 5 0.522
6 Smoker 6 0.0173
7 Non-Smoker 5 0.00554
8 Non-Smoker 6 0.120
9 Non-Smoker 5 0.00228
10 Non-Smoker 6 0.0432
11 Smoker 5 0.00483
12 Smoker 6 0.0495
13 Non-Smoker 5 0.00572
14 Non-Smoker 6 0.121
15 Smoker 5 0.00598
16 Smoker 6 0.00957
ggplot(percent_signif_long, aes(x=cluster, y=abundance, fill=group)) +
geom_boxplot()
NOTE: DO NOT PERFORM THE FOLLOWING COMMANDS
This is an example of the R commands to load 10X data with both gene expression (GEX) and feature barcoding (FBC) data into Seurat. For this workshop we will be providing a loaded, filtered, and clustered R data object (.rds file). So, skip ahead to Basic visualization of FBC data for the actual exercise.
Read10X
function.data_w_fbc <- Read10X("Sample_A_matrix/", gene.column=1)
NOTE: When the data are loaded by Read10X, it will create a
R list with elements for both the gene expression (GEX)
data and the feature barcoding (FBC) data. You can use the following
command to view the elements in this list. This is not necessary if you
already know the type of FBC data included.
names(data_w_fbc)
sc_w_fbc <- CreateSeuratObject(counts = data_w_fbc[['Gene Expression']])
sc_w_fbc[["Protein"]] <- CreateAssayObject(counts = data_w_fbc[["Antibody Capture"]])
For this exercise we will be providing a loaded, filtered, and clustered R data object. The original data are an example dataset from 10X (https://www.10xgenomics.com/resources/datasets/human-pbmc-from-a-healthy-donor-1-k-cells-v-2-2-standard-4-0-0).
For this Seurat object we had performed the following.
nFeature_RNA > 1000 & nCount_RNA > 3000 & percent.MT < 10.NormalizeData().FindVariableFeatures().RunPCA().FindNeighbors() using the first 10 PCA
dimensions.FindClusters() with a resolution
of 1.sc_w_fbc <- readRDS(url("https://wd.cri.uic.edu/scrna/sc_fbc.rds"))
DimPlot(sc_w_fbc, reduction='umap', label=T)
3.
List the features included in the FBC (Protein) assay.
row.names(sc_w_fbc[["Protein"]])
[1] "CD3" "CD19" "CD45RA" "CD4" "CD8a"
[6] "CD14" "CD16" "CD56" "CD25" "CD45RO"
[11] "PD-1" "TIGIT" "IgG1" "IgG2a" "IgG2b"
[16] "CD127" "CD15" "CD197-(CCR7)" "HLA-DR"
. Generate violin plots including FBC data and grouped by cluster.
VlnPlot(sc_w_fbc,
features = c("nFeature_RNA","nCount_RNA", "nCount_Protein"),
pt.size=0.2)
5.
Normalize the FBC data using a center log ratio (CLR) normalization.
sc_w_fbc <- NormalizeData(sc_w_fbc, assay="Protein", normalization.method="CLR")
Normalizing across features
DefaultAssay(sc_w_fbc)
[1] "RNA"
DefaultAssay(sc_w_fbc) <- "Protein"
DefaultAssay(sc_w_fbc)
[1] "Protein"
row.names(sc_w_fbc[["Protein"]])
[1] "CD3" "CD19" "CD45RA" "CD4" "CD8a"
[6] "CD14" "CD16" "CD56" "CD25" "CD45RO"
[11] "PD-1" "TIGIT" "IgG1" "IgG2a" "IgG2b"
[16] "CD127" "CD15" "CD197-(CCR7)" "HLA-DR"
sel_ab <- c("CD3", "CD4", "CD8a", "IgG1")
FeaturePlot(sc_w_fbc, features=sel_ab, label=T)
8. Create dot plot of FBC features
DotPlot(sc_w_fbc, features=sel_ab)
VlnPlot(sc_w_fbc, features=sel_ab)
sc_w_fbc <- ScaleData(sc_w_fbc, assay="Protein")
Centering and scaling data matrix
DoHeatmap(sc_w_fbc, features=sel_ab)
DoHeatmap(sc_w_fbc, features=row.names(sc_w_fbc[["Protein"]]))
sel_features <- c("protein_CD4", "rna_ENSG00000010610",
"protein_CD8a", "rna_ENSG00000153563")
FeaturePlot(sc_w_fbc, features = sel_features, label=T)
3. Generate dot plots.
DotPlot(sc_w_fbc, features = sel_features) + RotatedAxis()
4. Generate violin plots.
VlnPlot(sc_w_fbc, features = sel_features, ncol=2)
fbc_presence <- t(sc_w_fbc[["Protein"]]@data > 1)
cluster_ids <- data.frame(Cluster=sc_w_fbc$seurat_clusters)
fbc_presence <- merge(cluster_ids, fbc_presence, by.x=0, by.y=0)
cd4_counts <- table(fbc_presence[,c("Cluster", "CD4")])
cd4_counts
CD4
Cluster FALSE TRUE
0 10 213
1 55 137
2 6 87
3 91 0
4 80 4
5 37 1
6 31 6
7 31 0
8 7 9
cd4_counts_df <- as.data.frame(cd4_counts)
cd4_counts <- merge(subset(cd4_counts_df, CD4 == "TRUE", select=c(Cluster, Freq)),
subset(cd4_counts_df, CD4 == "FALSE", select=c(Cluster, Freq)),
by.x=1, by.y=1)
colnames(cd4_counts)[2:3] <- c("Present", "Absent")
cd4_counts$Fraction <- cd4_counts$Present / ( cd4_counts$Present + cd4_counts$Absent )
cd4_counts
Cluster Present Absent Fraction
1 0 213 10 0.95515695
2 1 137 55 0.71354167
3 2 87 6 0.93548387
4 3 0 91 0.00000000
5 4 4 80 0.04761905
6 5 1 37 0.02631579
7 6 6 31 0.16216216
8 7 0 31 0.00000000
9 8 9 7 0.56250000
Information about the format of the
filtered_contig_annotations.csv file can be found at https://support.10xgenomics.com/single-cell-vdj/software/pipelines/latest/output/annotation#contig.
For this exercise, the following contig annotations were provided.
| Column | Description |
|---|---|
| barcode | Cell-barcode for this contig. |
| is_cell | True or False value indicating whether the barcode was called as a cell. |
| contig_id | Unique identifier for this contig. |
| high_confidence | True or False value indicating whether the contig was called as high-confidence (unlikely to be a chimeric sequence or some other artifact). |
| length | The contig sequence length in nucleotides. |
| chain | The chain associated with this contig; for example, TRA, TRB, IGK, IGL, or IGH. A value of “Multi” indicates that segments from multiple chains were present. |
| v_gene | The highest-scoring V segment, for example, TRAV1-1. |
| d_gene | The highest-scoring D segment, for example, TRBD1. |
| j_gene | The highest-scoring J segment, for example, TRAJ1-1. |
| c_gene | The highest-scoring C segment, for example, TRAC. |
| full_length | If the contig was declared as full-length. |
| productive | If the contig was declared as productive. |
| cdr3 | The predicted CDR3 amino acid sequence. |
| cdr3_nt | The predicted CDR3 nucleotide sequence. |
| reads | The number of reads aligned to this contig. |
| umis | The number of distinct UMIs aligned to this contig. |
| raw_clonotype_id | The ID of the clonotype to which this cell barcode was assigned. |
| raw_consensus_id | The ID of the consensus sequence to which this contig was assigned. |
We will start with a very basic analysis: check which cells have TRA or TRB chains. Depending on the goals of your project, you will also want to dig into the specific CDR3 sequences to compare between cells or between clusters.
Load the contig annotations into R.
sc_vdj_contigs <- read.csv("https://wd.cri.uic.edu/scrna/tcr_contigs.csv")
Clean up the barcode IDs, so they match the IDs in the Seurat object (missing the trailing numbers).
sc_vdj_contigs$barcode <- sub('-[0-9]+$', '', sc_vdj_contigs$barcode)
Check how many cells have TRA or TRB or both.
# start with all of the cells
all_cells <- sc_w_fbc@meta.data["seurat_clusters"]
all_cells$TRA <- rownames(all_cells) %in% sc_vdj_contigs$barcode[sc_vdj_contigs$chain=="TRA"]
all_cells$TRB <- rownames(all_cells) %in% sc_vdj_contigs$barcode[sc_vdj_contigs$chain=="TRB"]
all_cells$TRAB <- all_cells$TRA & all_cells$TRB
head(all_cells)
seurat_clusters TRA TRB TRAB
AAACCTGCAGCCTGTG 6 FALSE FALSE FALSE
AAACGGGTCGCTGATA 2 TRUE TRUE TRUE
AAAGCAAAGTATGACA 2 TRUE TRUE TRUE
AAATGCCCACTTAAGC 0 TRUE TRUE TRUE
AACACGTCAACAACCT 0 TRUE TRUE TRUE
AACACGTCAAGCGATG 1 FALSE FALSE FALSE
Summary stats of complete A and B chains per cluster.
table(all_cells[,c("seurat_clusters","TRAB")])
TRAB
seurat_clusters FALSE TRUE
0 19 204
1 191 1
2 5 88
3 9 82
4 84 0
5 4 34
6 12 25
7 31 0
8 16 0
NOTE: This exercise will extract the UMAP plot data from the Seurat object and create a custom plot. It is also possible to add the TCR data to the Seurat object as a secondary assay, akin to the FBC data, and use the Seurat tools.
Get the UMAP coordinates as a data.frame.
umap_coords <- as.data.frame(Embeddings(sc_w_fbc, reduction="umap"))
# Take a peek at the results
head(umap_coords)
umap_1 umap_2
AAACCTGCAGCCTGTG -7.553978 2.770488
AAACGGGTCGCTGATA -5.912212 -1.226433
AAAGCAAAGTATGACA -6.879341 -0.258912
AAATGCCCACTTAAGC -6.659776 -5.674999
AACACGTCAACAACCT -4.886473 -5.689088
AACACGTCAAGCGATG 9.418021 -2.538629
Merge the UMAP coordinates and chain information.
plot_data <- merge(all_cells, umap_coords, by=0)
# Take a peek at the results
head(plot_data)
Row.names seurat_clusters TRA TRB TRAB umap_1 umap_2
1 AAACCTGCAGCCTGTG 6 FALSE FALSE FALSE -7.553978 2.770488
2 AAACGGGTCGCTGATA 2 TRUE TRUE TRUE -5.912212 -1.226433
3 AAAGCAAAGTATGACA 2 TRUE TRUE TRUE -6.879341 -0.258912
4 AAATGCCCACTTAAGC 0 TRUE TRUE TRUE -6.659776 -5.674999
5 AACACGTCAACAACCT 0 TRUE TRUE TRUE -4.886473 -5.689088
6 AACACGTCAAGCGATG 1 FALSE FALSE FALSE 9.418021 -2.538629
Create the basic plot, UMAP plot colored by chain information. The
addition of
guides(color=guide_legend(override.aes = list(size=3)))
will set the size of the plot character in the legend to be a different
size than the plot characters used in the plot.
library(ggplot2)
ggplot(plot_data, aes(x=umap_1, y=umap_2, color=TRAB)) +
geom_point(size=0.5) + theme_classic() +
guides(color=guide_legend(override.aes = list(size=3)))
ggplot(plot_data, aes(x=umap_1, y=umap_2, color=seurat_clusters)) +
geom_point(size=0.5) + theme_classic() +
guides(color=guide_legend(override.aes = list(size=3)))
NOTE: Additional exercise of incorporating V(D)J data into Seurat is found in Extra (Take Home) Exercises.
The following packages are needed for the take-home exercises:
scds and SingleCellExperiment are
needed to complete the doublet removal exercise we skipped (Exercise
1.3)BiocManager::install("scds", update=F)
BiocManager::install("SingleCellExperiment", update=F)
BiocManager::install("monocle", update=F)
install.packages('fossil')
install.packages('cowplot')
install.packages('vegan')
devtools is installed (needed for
presto before).if ( ! requireNamespace("devtools"))
install.packages("devtools")
devtools::install_local("PATH/TO/DOWNLOAD/CytoTRACE_0.3.3.tar.gz")
NOTE: If CytoTRACE could not be installed because
ERROR: dependency 'sva' is not available for package 'CytoTRACE',
install “sva” using BiocManager.
BiocManager::install("sva", update=F)
library(cowplot)
library(CytoTRACE)
library(fossil)
library(monocle)
library(scds)
library(SingleCellExperiment)
library(vegan)
We can use the same object, as each resolution will get saved into its own slot.
sc_subset <- FindClusters(sc_subset, resolution=0.1)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 7027
Number of edges: 255224
Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9473
Number of communities: 6
Elapsed time: 0 seconds
sc_subset <- FindClusters(sc_subset, resolution=1)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 7027
Number of edges: 255224
Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8393
Number of communities: 15
Elapsed time: 0 seconds
table(sc_subset@meta.data$RNA_snn_res.0.1)
0 1 2 3 4 5
3348 2336 475 422 412 34
table(sc_subset@meta.data$RNA_snn_res.0.5)
0 1 2 3 4 5 6 7 8 9 10 11
1259 1086 941 790 580 516 510 421 414 247 228 35
table(sc_subset@meta.data$RNA_snn_res.1)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1061 865 804 575 529 486 466 421 411 395 333 247 228 171 35
fossil librarylibrary(fossil)
clust0.1 <- sc_subset@meta.data$RNA_snn_res.0.1
clust0.5 <- sc_subset@meta.data$RNA_snn_res.0.5
clust1 <- sc_subset@meta.data$RNA_snn_res.1
adj.rand.index(clust0.1, clust0.5)
[1] 0.6545382
adj.rand.index(clust0.1, clust1)
[1] 0.6578186
adj.rand.index(clust0.5, clust1)
[1] 0.8855797
NOTE: If you ended up with a large number of clustering results and you wanted to compute pair-wise similarities across all of them using the adjusted rand index, you would approach this by setting up a data frame with all of the clustering results in it across the columns, then running a loop over all pairs of columns.:
# building a data frame for our 3 clustering results
cluster.df <- data.frame(clust0.1, clust0.5, clust1)
# define a function to make a similarity matrix
cluster_similarity <- function( clust_df ){
# number of columns we're comparing
columns <- ncol(clust_df)
# set up a similarity matrix
rand.sim <- matrix( nrow=columns, ncol=columns )
rownames(rand.sim) = colnames(clust_df)
colnames(rand.sim) = colnames(clust_df)
# set all values to be 1 first
# this will keep the diagonal entries 1 after we do the other calculations
rand.sim[,] <- 1
# loop over all pairs of columns
for( i in 1:(columns-1) ){
for( j in (i+1):columns ){
# compute the similarity and store it at positions i,j and j,i
sim <- adj.rand.index( clust_df[,i], clust_df[,j] )
rand.sim[i,j] <- sim
rand.sim[j,i] <- sim
}
}
return(rand.sim)
}
# run the function
cluster.sim <- cluster_similarity(cluster.df)
cluster.sim
The goal is to compare the actual sets of cells in each clustering result, so that we can see which clusters between the results are related to each other. As a first pass, we can evaluate this qualitatively with two UMAP plots.
DimPlot(sc_subset, reduction='umap', group.by="RNA_snn_res.0.1", label=T)
DimPlot(sc_subset, reduction='umap', group.by="RNA_snn_res.1", label=T)
Quantitative comparison:
#), but
they are there to explain our steps along the way.Source this function to save time:
source("https://wd.cri.uic.edu/scrna/overlap_index.R")
Here is its definition, if you wanted to write it out:
# function for overlap index
overlap_index <- function( cluster1, cluster2 ){
# make factor vectors from clusters
c1 = as.factor(cluster1)
c2 = as.factor(cluster2)
# define a set of "names" for our cells, which we'll compare between clusters
# the names are arbitraty, so we'll just number them
names = 1:length(c1)
# make distance matrix to store our calculations in
my_dist = matrix(nrow=length(levels(c1)),ncol=length(levels(c2)))
rownames(my_dist) = levels(c1)
colnames(my_dist) = levels(c2)
for(i in 1:length(levels(c1))){
for(j in 1:length(levels(c2))){
# get the list of cells in each cluster
c1.sub = names[c1==levels(c1)[i]]
c2.sub = names[c2==levels(c2)[j]]
# get the intersection and min cluster size
int = length(intersect(c1.sub,c2.sub))
minsize = min(length(c1.sub),length(c2.sub))
# overlap index
overlap = int/minsize
# store in matrix
my_dist[i,j] = overlap
}
}
return(my_dist)
}
Now run the function
res_0.1vs1 <- overlap_index(clust0.1, clust1)
res_0.1vs1
0 1 2 3 4 5 6 7 8
0 0.9990574929 0.01849711 0.9738806 0.003478261 0.992438563 0.997942387 1 0 0
1 0.0009425071 0.98150289 0.0261194 0.996521739 0.007561437 0.000000000 0 0 0
2 0.0000000000 0.00000000 0.0000000 0.000000000 0.000000000 0.000000000 0 0 0
3 0.0000000000 0.00000000 0.0000000 0.000000000 0.000000000 0.000000000 0 1 0
4 0.0000000000 0.00000000 0.0000000 0.000000000 0.000000000 0.002427184 0 0 1
5 0.0000000000 0.00000000 0.0000000 0.000000000 0.000000000 0.000000000 0 0 0
9 10 11 12 13 14
0 0.02531646 0 0 0 0.005847953 0.00000000
1 0.97468354 1 0 0 0.994152047 0.00000000
2 0.00000000 0 1 1 0.000000000 0.00000000
3 0.00000000 0 0 0 0.000000000 0.02857143
4 0.00000000 0 0 0 0.000000000 0.00000000
5 0.00000000 0 0 0 0.000000000 1.00000000
library(ComplexHeatmap)
Heatmap(res_0.1vs1,
col = c("white","yellow","red"),
name = "Overlap Index",
column_title = "Resolution 1",
row_title = "Resolution 0.1")
We may also find it useful to look at the expression of these markers on the UMAP plot. Here’s how we can use Seurat’s FeaturePlot function, but add in our own custom labels:
library(cowplot)
# run FeaturePlot with combine=F so we get all separate plots
# 'myplots' will be a list of ggplot objects
myplots <- FeaturePlot(sc_subset, markers$ID, combine=F)
# for each plot in the list, add a new title with the description
for(i in 1:length(myplots)){
myplots[[i]] = myplots[[i]] + ggtitle(markers_description[i])
}
# then make a new combined plot with plot_grid
plot_grid(plotlist=myplots, ncol=3)
scType can be run by simply sourcing a function from their GitHub page.
The database for scType is a named list of marker genes for each cell type, which we will make from our marker list above. You can also provide negative markers in a separate list if desired.
For more information, see the scType documentation: https://sctype.app.
# source the scType functions from github
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/sctype_score_.R")
# prepare gene sets using the marker list above
celltypes <- unique(markers$Cell.Type)
celltypes_genes <- sapply(celltypes, function(x){
subset(markers, Cell.Type==x)$ID})
# this is a named list of gene markers for each cell type
celltypes_genes
$`B cells`
[1] "ENSG00000105369" "ENSG00000156738" "ENSG00000119866"
$Monocytes
[1] "ENSG00000142089" "ENSG00000197746" "ENSG00000101439" "ENSG00000090382"
$`NK cells`
[1] "ENSG00000115523" "ENSG00000203747"
$`T cells`
[1] "ENSG00000167286" "ENSG00000198851" "ENSG00000160654" "ENSG00000188404"
# also include CD3 genes as negative markers for NK cells
negative_markers <- list("NK cells"=markers$ID[grep("^CD3",markers$Marker)])
# obtain normalized expression for these genes
norm <- GetAssayData(sc_subset, layer="data")
norm <- as.matrix(norm[markers$ID,])
# run scType
# scaled=F means that we are supplying normalized, but not z-scored data
# and scType will run z-scoring then
# the "gs2" parameter is for negative markers (optional in general)
sctype_result <- sctype_score( norm, scaled=F,
gs=celltypes_genes, gs2=negative_markers )
# transpose and peek at results
sctype_result <- t(sctype_result)
head(sctype_result)
B cells Monocytes NK cells T cells
s1_AAACGGGAGACTGTAA-1 6.5444141 -0.2445015 2.310882 -1.9307517
s1_AAACGGGTCTCAAACG-1 -0.4459658 -0.8862130 -2.178354 1.6114778
s1_AAAGTAGGTAGCAAAT-1 -0.4459658 -0.8862130 -1.481604 1.3904256
s1_AAAGTAGTCCGCAAGC-1 -0.4459658 1.0578332 -2.198394 0.9068864
s1_AAATGCCTCTTGAGGT-1 -0.4459658 0.2940607 -1.348341 0.1707191
s1_AACACGTCAACTGCTA-1 -0.4459658 -0.3488254 -1.844020 1.3543735
# assign a cell type per cell based on the max score
celltypes <- apply(sctype_result, 1, function(x){
colnames(sctype_result)[which.max(x)]})
# count the number of each cell type per cluster
cluster_types <- data.frame(
Cluster = sc_subset@meta.data$RNA_snn_res.0.5,
Type = celltypes)
cluster_summary <- table(cluster_types)
cluster_summary
Type
Cluster B cells Monocytes NK cells T cells
0 46 95 272 846
1 55 99 246 686
2 42 84 275 540
3 34 104 128 524
4 17 62 84 417
5 24 41 59 392
6 32 46 81 351
7 2 417 2 0
8 0 1 411 2
9 237 0 10 0
10 221 0 7 0
11 0 23 12 0
# obtain the majority cell type per cluster
cluster_type <- apply(cluster_summary,1,function(x){
colnames(cluster_summary)[which.max(x)]})
cluster_type
0 1 2 3 4 5
"T cells" "T cells" "T cells" "T cells" "T cells" "T cells"
6 7 8 9 10 11
"T cells" "Monocytes" "NK cells" "B cells" "B cells" "Monocytes"
# assign these cell types in the Seurat object
sc_subset@meta.data$CellType <- cluster_type[sc_subset@meta.data$RNA_snn_res.0.5]
DimPlot(sc_subset, group.by = "CellType")
# compare to the clusters
DimPlot(sc_subset, reduction='umap', label=T)
Compare this to what you would have assigned based on the marker genes in Exercise 2.2, as well as to the label transfer results in that exercise.
Make nicer versions of these plots, with a little extra work.
Violin plots:
library(cowplot)
myplots <- VlnPlot(sc_subset, features=top5_cluster5$gene, combine=F)
for(i in 1:length(myplots)){
myplots[[i]] = myplots[[i]] + ggtitle(top5_cluster5$symbol[i])
}
plot_grid(plotlist=myplots)
Dotplots:
DotPlot(sc_subset, features=top5_cluster5$gene ) +
scale_x_discrete(labels=function(x) genenames[x,1])
Heatmap:
Option 1, using Seurat’s DoHeatmap (this will still omit genes that were not z-scaled - not part of the variable features).
DoHeatmap(sc_subset, features=top5$gene) +
scale_y_discrete(labels=function(x) genenames[x,1])
Option 2, using ComplexHeatmap
# get normalized expression data of interest
norm.data <- GetAssayData(sc_subset,layer="data")
norm.data <- norm.data[top5$gene,]
# z-score
norm.data <- t(scale(t(norm.data)))
# order the columns by cluster
columns.clusters <- sc_subset@meta.data$seurat_clusters
columns.order <- order(columns.clusters)
norm.data <- norm.data[,columns.order]
columns.clusters <- columns.clusters[columns.order]
# make heatmap
library(ComplexHeatmap)
library(circlize)
col_fun <- colorRamp2(breaks=c(-2,0,2),colors=c("blue","white","red"))
column_annotation <- HeatmapAnnotation(cluster=columns.clusters)
Heatmap(norm.data, name="Z-score", col=col_fun,
cluster_columns=F, show_column_names=F, column_split = columns.clusters,
cluster_rows=F, row_labels=top5$symbol)
FindAllMarkers) to test specific subsets of clusters.orig.ident.cluster1vs2 <- FindMarkers(sc_subset, ident.1=1, ident.2=2, group.by="seurat_clusters")
cluster1vs2and3 <- FindMarkers(sc_subset, ident.1=1, ident.2=c(2,3),
group.by="seurat_clusters")
# check we've assigned cell types from the label transfer analysis
sc_subset@meta.data$CellType <- prediction_type[sc_subset@meta.data$RNA_snn_res.0.5]
Idents(sc_subset) <- "CellType"
stats_between_types <- FindAllMarkers(sc_subset, test.use="wilcox")
Calculating cluster Bcells
Calculating cluster CD8Tcells
Calculating cluster CD4Tcells
Calculating cluster Monocytes
Calculating cluster NKcells
NOTE: use a loop in R to perform this analysis over all clusters.
library(tidyr)
sc_cluster1 <- subset(sc_subset, subset = seurat_clusters == 1)
samples <- unique(sc_cluster1$orig.ident)
counts <- GetAssayData(sc_cluster1,layer="counts")
cluster1_counts <- sapply(samples, function(x){
# boolean vector indicating which cells are in sample x
sample.cells <- sc_cluster1$orig.ident == x
# sum counts over all of these cells
sample.counts <- Matrix::rowSums(counts[,sample.cells])
return(sample.counts)
})
# this is the pseudo-bulk counts table for this cluster
head(cluster1_counts)
V035_F031 V039_F093 V040_F198 V041_F043 V043_F047 V044_F037
ENSG00000243485 0 0 0 0 0 0
ENSG00000237613 0 0 0 0 0 0
ENSG00000186092 0 0 0 0 0 0
ENSG00000238009 0 0 1 0 0 0
ENSG00000239945 0 0 0 0 0 0
ENSG00000237683 0 0 0 0 1 0
V045_F120 V054_F209
ENSG00000243485 0 0
ENSG00000237613 0 0
ENSG00000186092 0 0
ENSG00000238009 0 0
ENSG00000239945 0 0
ENSG00000237683 0 1
# now run differential expression statistics in edgeR
library(edgeR)
# obtain metadata and match column order
metadata <- read.delim("https://wd.cri.uic.edu/scrna/GSE138867_metadata.txt",row.names=1)
metadata <- metadata[colnames(cluster1_counts),]
# subset genes to those expressed
cluster1_counts_use = cluster1_counts[rowSums(cluster1_counts)>20,]
# prepare edgeR object
cluster1_dge <- DGEList(cluster1_counts_use, group=metadata)
# normalization and dispersion
cluster1_dge <- calcNormFactors(cluster1_dge)
cluster1_dge <- estimateDisp(cluster1_dge)
Using classic mode.
# check the BCV level
sqrt(cluster1_dge$common.dispersion)
[1] 0.2676988
# run differential stats
cluster1_result <- exactTest(cluster1_dge)$table
cluster1_result$QValue <- p.adjust(cluster1_result$PValue)
head(cluster1_result)
logFC logCPM PValue QValue
ENSG00000228463 -0.06724653 3.454524 1.0000000 1
ENSG00000230368 0.44547704 3.416732 1.0000000 1
ENSG00000188976 0.26854915 5.370629 0.7806707 1
ENSG00000187608 0.37810777 6.713287 0.3632735 1
ENSG00000186891 0.02712532 4.657649 0.4791354 1
ENSG00000186827 0.06737773 5.734380 0.6599663 1
# count how many are significant
table(cluster1_result$QValue < 0.05)
FALSE TRUE
7925 5
# comparison of DEGs here and from single-cell comparison
colnames(cluster1_result) <- paste0("Pseudobulk_",colnames(cluster1_result))
deg_comparison <- merge(cluster1_result, cluster1_stats, by=0)
subset(deg_comparison, Pseudobulk_QValue < 0.05)
Row.names Pseudobulk_logFC Pseudobulk_logCPM Pseudobulk_PValue
6829 ENSG00000188536 4.025540 7.579265 3.331712e-14
7308 ENSG00000206172 4.356953 6.600759 2.683781e-21
7565 ENSG00000236016 3.098049 5.191447 5.921193e-09
7652 ENSG00000244734 5.194182 8.778717 3.886196e-31
7733 ENSG00000255823 2.926671 5.027914 5.819543e-08
Pseudobulk_QValue p_val avg_log2FC pct.1 pct.2 p_val_adj
6829 2.641382e-10 5.914654e-16 4.155458 0.229 0.049 4.209966e-13
7308 2.127970e-17 3.328512e-03 3.770804 0.069 0.029 1.355351e-01
7565 4.693730e-05 3.224137e-11 2.957707 0.139 0.022 1.235712e-08
7652 3.081754e-27 2.882615e-20 4.821283 0.251 0.040 5.745051e-17
7733 4.612570e-04 2.997342e-16 3.260734 0.171 0.016 2.489043e-13
symbol
6829 HBA2
7308 HBA1
7565 RP1-3J17.3
7652 HBB
7733 MTRNR2L8
Seurat (needed to manipulate the previously
create Seurat object) and monocle for the pseudotime
analysis# load both libraries, if not done already
library(Seurat)
library(monocle)
sc_subset <- readRDS(url("https://wd.cri.uic.edu/scrna/clustered.rds"))
Monocle has a function called “importCDS” that is supposed to import from Seurat. But it doesn’t work. Fun. We have created a fixed version of the function, which you can source from public-data.
Pro tip: If you want to see the code of a function, just type its name and hit enter. E.g., type “importCDS” to see the code for this function in Monocle2. After you’ve sourced our R script, you can type “importCDS2” to see our version.
# source the R script
source("https://wd.cri.uic.edu/scrna/importCDS2.R")
# use the importCDS2 function (our function) instead of importCDS (monocle function)
monocle_data <- importCDS2(sc_subset)
Warning in .sparse2dense(e1): sparse->dense coercion: allocating vector of size
1.7 GiB
Warning in .sparse2dense(e2): sparse->dense coercion: allocating vector of size
1.7 GiB
monocle_data
CellDataSet (storageMode: environment)
assayData: 32738 features, 7027 samples
element names: exprs
protocolData: none
phenoData
sampleNames: s1_AAACGGGAGACTGTAA-1 s1_AAACGGGTCTCAAACG-1 ...
s8_TTTGTCAGTGACGGTA-1 (7027 total)
varLabels: orig.ident nCount_RNA ... Size_Factor (10 total)
varMetadata: labelDescription
featureData
featureNames: ENSG00000243485 ENSG00000237613 ... ENSG00000215611
(32738 total)
fvarLabels: gene_short_name
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation:
We need to tell Monocle which genes to base the analysis on. Monocle has a set of steps to do this, similar to how we picked the variable genes before. But we can just use the variable genes we’ve already selected.
monocle_data <- estimateSizeFactors(monocle_data)
# get variable genes from Seurat object
top.genes <- VariableFeatures(sc_subset)
monocle_data <- setOrderingFilter(monocle_data, top.genes)
monocle_data <- reduceDimension(monocle_data, max_components=2, method='DDRTree')
monocle_data <- orderCells(monocle_data)
plot_cell_trajectory(monocle_data)
plot_cell_trajectory(monocle_data, color_by = "Pseudotime")
plot_cell_trajectory(monocle_data, color_by = "RNA_snn_res.0.5")
The pData function from the monocle package will return
a data.frame with the computed psuedotime values. As the original Seurat
data were also imported, e.g. clusters, original idents.
head(pData(monocle_data))
orig.ident nCount_RNA nFeature_RNA group percent.MT
s1_AAACGGGAGACTGTAA-1 V035_F031 5201 1237 Smoker 3.653144
s1_AAACGGGTCTCAAACG-1 V035_F031 3909 1169 Smoker 3.223331
s1_AAAGTAGGTAGCAAAT-1 V035_F031 5026 1335 Smoker 1.929964
s1_AAAGTAGTCCGCAAGC-1 V035_F031 4087 1259 Smoker 4.061659
s1_AAATGCCTCTTGAGGT-1 V035_F031 5521 1292 Smoker 1.829379
s1_AACACGTCAACTGCTA-1 V035_F031 3614 804 Smoker 2.102933
IsDoublet RNA_snn_res.0.5 seurat_clusters Size_Factor
s1_AAACGGGAGACTGTAA-1 FALSE 10 10 1.3580499
s1_AAACGGGTCTCAAACG-1 FALSE 2 2 1.0206916
s1_AAAGTAGGTAGCAAAT-1 FALSE 0 0 1.3123551
s1_AAAGTAGTCCGCAAGC-1 FALSE 3 3 1.0671698
s1_AAATGCCTCTTGAGGT-1 FALSE 3 3 1.4416061
s1_AACACGTCAACTGCTA-1 FALSE 0 0 0.9436632
Pseudotime State
s1_AAACGGGAGACTGTAA-1 9.067058 1
s1_AAACGGGTCTCAAACG-1 18.841316 2
s1_AAAGTAGGTAGCAAAT-1 15.121057 3
s1_AAAGTAGTCCGCAAGC-1 19.718525 2
s1_AAATGCCTCTTGAGGT-1 15.586364 2
s1_AACACGTCAACTGCTA-1 19.458620 3
# Get the top marker gene for cluster 7
topgene = degs$gene[degs$cluster==7][1]
# plot using gene expression as size of dot
plot_cell_trajectory(monocle_data, markers = topgene, use_color_gradient=T)
monocle_data@reducedDimW has “whitened” cell coordinates in
the reduced dimension space. Analogous to the loadings from PCAhead(t(monocle_data@reducedDimS))
[,1] [,2]
s1_AAACGGGAGACTGTAA-1 5.8585574 2.6289045
s1_AAACGGGTCTCAAACG-1 -0.6354290 -2.8018766
s1_AAAGTAGGTAGCAAAT-1 -0.0216304 1.8470248
s1_AAAGTAGTCCGCAAGC-1 -0.4642292 -3.6755653
s1_AAATGCCTCTTGAGGT-1 -0.1627147 0.5634199
s1_AACACGTCAACTGCTA-1 -3.9413785 3.6908672
clusters <- sc_subset@meta.data$RNA_snn_res.0.5
pseudotime <- pData(monocle_data)$Pseudotime
cluster_vs_time <- data.frame(clusters, pseudotime)
ggplot(cluster_vs_time, aes(x=clusters, y=pseudotime)) +
geom_boxplot()
There is clearly a strong association of cluster vs time. Further tests may be helpful if you want to test the significance of which clusters come earlier or later with respect to pseudotime. For now, we can do an overall test using Kruskal Wallis.
kruskal.test(pseudotime ~ clusters)
Kruskal-Wallis rank sum test
data: pseudotime by clusters
Kruskal-Wallis chi-squared = 4920.9, df = 11, p-value < 2.2e-16
CytoTRACE (Cellular (Cyto) Trajectory Reconstruction Analysis using gene Counts and Expression) is a computational method that predicts the differentiation state of cells from single-cell RNA-sequencing data. More details on this package can be found at https://cytotrace.stanford.edu/
BiocManager (dependency for
CytoTRACE)if (! requireNamespace("BiocManager"))
install.packages("BiocManager")
BiocManager::install("sva", update=F)
If you are using a Linux or macOS system, you can download the installation package using the following command.
wget https://cytotrace.stanford.edu/CytoTRACE_0.3.3.tar.gz
R. Modify the
PATH/TO/DOWNLOAD to reflect the directory where you
download the CytoTRACE installation package.if ( ! requireNamespace("devtools"))
install.packages("devtools")
devtools::install_local("PATH/TO/DOWNLOAD/CytoTRACE_0.3.3.tar.gz")
scanoramaCT and numpy. If you are using a
Linux or macOS system, you can execute the following commands to install
these packagespip install scanoramaCT
pip install numpy
library(Seurat)
library(CytoTRACE,verbose=F)
Welcome to the CytoTRACE R package, a tool for the unbiased prediction of differentiation states in scRNA-seq data. For more information about this method, visit https://cytotrace.stanford.edu.
sc_obj <- readRDS("clustered.rds")
sc_obj.1_2 <- subset(sc_obj, subset = RNA_snn_res.0.5 == 1 | RNA_snn_res.0.5 == 2)
sc_counts <- GetAssayData(sc_obj,layer="counts")
cyto_results <- CytoTRACE(as.matrix(sc_counts), enableFast=F)
The number of cells in your dataset exceeds 3,000. CytoTRACE will now be run
in fast mode (see documentation). You can multi-thread this run using the
'ncores' flag. To disable fast mode, please indicate 'enableFast = FALSE'.
CytoTRACE will be run on 1 sub-sample(s) of approximately 7027 cells each using 1 / 1 core(s)
Pre-processing data and generating similarity matrix...
Calculating gene counts signature...
Smoothing values with NNLS regression and diffusion...
Calculating genes associated with CytoTRACE...
Done
Warning message:
In CytoTRACE(sc_counts, enableFast = F) :
12950 genes have zero expression in the matrix and were filtered
For this exercise, we will add the results from CytoTRACE back into the Seurat object. This will allow us to use a number of the existing Seurat functions to analyze and visualize the CytoTRACE results.
sc_obj@meta.data$CytoTRACE <- cyto_results$CytoTRACE
Visualize the CytoTRACE time variable on a UMAP plot
FeaturePlot(sc_obj, reduction='umap', features='CytoTRACE')
Differential expression with respect to CytoTRACE time
# Get the count of cells expressing the gene.
# The code sum(x>0) is a shortcut to get a count of items greater than zero (0)
counts_sum = apply(sc_counts, 1, function(x) sum(x>0) )
Warning in asMethod(object): sparse->dense coercion: allocating vector of size
1.7 GiB
# Get the genes in which the count is greater than 25% of the number of cells.
genes_keep = counts_sum > ( 0.25 * ncol(sc_counts) )
# Subset the gene counts data to just those genes.
genes_data = GetAssayData(sc_obj,layer="data")[genes_keep,]
# Take a peek at the number of genes (rows) in the filtered data.
dim(genes_data)
[1] 1069 7027
NOTE We will get warnings with the Spearman test about ties.
# Use an apply function to run the Spearman test (correlation) for each row.
# We transpose the results as the apply function will combine the output of each
# iteration column wise and we want to have it by row and save as a data.frame
cytotrace_corr <- data.frame(t(apply(genes_data, 1, function (x) {
res <- cor.test(x, sc_obj@meta.data$CytoTRACE, method="spearman")
return(c(res$estimate, res$p.value))
})))
# Set the column names
colnames(cytotrace_corr) <- c("estimate", "PValue")
# Add FDR corrected PValue
cytotrace_corr$QValue <- p.adjust(cytotrace_corr$PValue, method="BH")
# Sort by absolute value of the correlation estimate
cytotrace_corr <- cytotrace_corr[order(abs(cytotrace_corr$estimate), decreasing = T), ]
# Peek at the first few results
head(cytotrace_corr)
estimate PValue QValue
ENSG00000075624 0.6781119 0 0
ENSG00000177954 -0.5962351 0 0
ENSG00000122026 -0.5909023 0 0
ENSG00000144713 -0.5856175 0 0
ENSG00000145425 -0.5789766 0 0
ENSG00000196154 0.5684408 0 0
Differential time with respect to cluster or sample
library(dplyr)
sc_obj@meta.data %>%
group_by(RNA_snn_res.0.5, group) %>%
summarize(mean=mean(CytoTRACE), sd=sd(CytoTRACE))
`summarise()` has grouped output by 'RNA_snn_res.0.5'. You can override using
the `.groups` argument.
# A tibble: 24 × 4
# Groups: RNA_snn_res.0.5 [12]
RNA_snn_res.0.5 group mean sd
<fct> <chr> <dbl> <dbl>
1 0 Non-Smoker 0.181 0.152
2 0 Smoker 0.279 0.186
3 1 Non-Smoker 0.464 0.199
4 1 Smoker 0.505 0.206
5 2 Non-Smoker 0.479 0.213
6 2 Smoker 0.696 0.210
7 3 Non-Smoker 0.518 0.258
8 3 Smoker 0.580 0.240
9 4 Non-Smoker 0.376 0.194
10 4 Smoker 0.380 0.220
# ℹ 14 more rows
library(ggplot2)
ggplot(sc_obj@meta.data, aes(x=orig.ident, y= CytoTRACE, fill=group)) +
theme_classic() +
geom_boxplot() +
facet_wrap(~ RNA_snn_res.0.5) +
labs(x="", y="CytoTRACE time", fill="Sample") +
theme(axis.text.x=element_text(angle=90))
kruskal.test(CytoTRACE ~ RNA_snn_res.0.5, data = sc_obj@meta.data)
Kruskal-Wallis rank sum test
data: CytoTRACE by RNA_snn_res.0.5
Kruskal-Wallis chi-squared = 3566.6, df = 11, p-value < 2.2e-16
cluster_stats <- sc_obj@meta.data %>% group_by(RNA_snn_res.0.5) %>%
do(w=wilcox.test(CytoTRACE ~ group, data=.)) %>%
summarize(ClusterID=RNA_snn_res.0.5, PValue=w$p.value)
head(cluster_stats)
# A tibble: 6 × 2
ClusterID PValue
<fct> <dbl>
1 0 2.35e-19
2 1 1.31e- 3
3 2 3.85e-46
4 3 5.55e- 4
5 4 9.01e- 1
6 5 1.95e- 1NOTE: The first five (5) steps are the same as in exercise 2.5.1.
sc_vdj_contigs <- read.csv("https://wd.cri.uic.edu/scrna/tcr_contigs.csv")
sc_vdj_contigs$barcode <- sub('-[0-9]+$', '', sc_vdj_contigs$barcode)
# First get a list of the detected chains
table(sc_vdj_contigs$chain)
Multi TRA TRB
6 578 614
# Then compute the chain counts per cell
vdj_chain_counts <- as.data.frame(table(sc_vdj_contigs[, c("barcode", "chain")]))
vdj_chain_counts$presence <- vdj_chain_counts$Freq > 0
# Take a peek at the results
head(vdj_chain_counts)
barcode chain Freq presence
1 AAACGGGTCGCTGATA Multi 0 FALSE
2 AAAGCAAAGTATGACA Multi 0 FALSE
3 AAATGCCCACTTAAGC Multi 0 FALSE
4 AACACGTCAACAACCT Multi 0 FALSE
5 AACACGTGTTATCCGA Multi 0 FALSE
6 AACACGTGTTATCGGT Multi 0 FALSE
cluster_ids <- data.frame(cluster=Idents(sc_w_fbc))
# Take a peek at the results
head(cluster_ids)
cluster
AAACCTGCAGCCTGTG 6
AAACGGGTCGCTGATA 2
AAAGCAAAGTATGACA 2
AAATGCCCACTTAAGC 0
AACACGTCAACAACCT 0
AACACGTCAAGCGATG 1
library(tidyr)
vdj_cell_chains <- pivot_wider(vdj_chain_counts[, c("barcode", "chain", "Freq")],
names_from="chain", values_from="Freq")
dim(vdj_cell_chains)
[1] 482 4
head(vdj_cell_chains)
# A tibble: 6 × 4
barcode Multi TRA TRB
<fct> <int> <int> <int>
1 AAACGGGTCGCTGATA 0 1 1
2 AAAGCAAAGTATGACA 0 1 1
3 AAATGCCCACTTAAGC 0 1 1
4 AACACGTCAACAACCT 0 2 3
5 AACACGTGTTATCCGA 0 2 1
6 AACACGTGTTATCGGT 0 1 1
vdj_cell_chains <- merge(cluster_ids, vdj_cell_chains,
by.x=0, by.y=1, all.x=T)
dim(vdj_cell_chains)
[1] 805 5
row.names(vdj_cell_chains) <- vdj_cell_chains[,1]
vdj_cell_chains <- as.matrix(vdj_cell_chains[, c(-1, -2)])
head(vdj_cell_chains)
Multi TRA TRB
AAACCTGCAGCCTGTG NA NA NA
AAACGGGTCGCTGATA 0 1 1
AAAGCAAAGTATGACA 0 1 1
AAATGCCCACTTAAGC 0 1 1
AACACGTCAACAACCT 0 2 3
AACACGTCAAGCGATG NA NA NA
NA values to be zero (no chains)vdj_cell_chains[is.na(vdj_cell_chains)] <- 0
NOTE: the CreateAssayObject expects a counts
table with features (chains) on the rows and cell on the columns. So, we
will need to transpose the matrix when we create the new assay
object.
sc_w_fbc[["TCR"]] <- CreateAssayObject(counts=t(vdj_cell_chains))
.rds file for repeated use.saveRDS(sc_w_fbc, file="seurat_obj_w_tcr.rds")
DefaultAssay(sc_w_fbc) <- "TCR"
FeaturePlot(sc_w_fbc, features = c("TRA", "TRB"), label=T)
DotPlot(sc_w_fbc, features=c("TRA", "TRB"))
DotPlot(sc_w_fbc, features=c("tcr_TRA", "tcr_TRB", "protein_CD4", "rna_ENSG00000010610")) +
RotatedAxis()
This example will use the library vegan to compute
diversity of V(D)J+C annotations as compared with different biological
samples.
sc_vdj_contigs <- read.delim("https://wd.cri.uic.edu/scrna/filtered_contig_annotations.txt")
sc_vdj_contigs$barcode <- sub('-[0-9]+$', '', sc_vdj_contigs$barcode)
cluster_ids <- read.delim("https://wd.cri.uic.edu/scrna/cell_info.txt", row.names=1)
head(cluster_ids)
cluster sample
Sample_001_AAACCTGAGCTGTCTA 14 Sample_001
Sample_001_AAACCTGCAAAGGAAG 29 Sample_001
Sample_001_AAACCTGGTGAGGGTT 9 Sample_001
Sample_001_AAACGGGGTGAACCTT 12 Sample_001
Sample_001_AAACGGGGTTGATTCG 8 Sample_001
Sample_001_AAACGGGTCCTATTCA 13 Sample_001
vdj_data <- subset(sc_vdj_contigs, select=c(barcode, cdr3))
head(vdj_data)
barcode cdr3
1 Sample_001_AAACCTGAGCTGTCTA CASSDLGAGTGQLYF
2 Sample_001_AAACCTGAGCTGTCTA CAVRDPPGNTRKLIF
3 Sample_001_AAACCTGGTGAGGGTT CASRSGTGDYEQYF
4 Sample_001_AACGTTGAGTAGCCGA CALSESGGKLTL
5 Sample_001_AACGTTGAGTAGCCGA CASSLKTGGYAEQFF
6 Sample_001_AACTTTCAGGGAAACA CASSLKTGGYAEQFF
vdj_data <- subset(sc_vdj_contigs, chain=="TRA", select=c(barcode, cdr3))
head(vdj_data)
barcode cdr3
2 Sample_001_AAACCTGAGCTGTCTA CAVRDPPGNTRKLIF
4 Sample_001_AACGTTGAGTAGCCGA CALSESGGKLTL
7 Sample_001_AACTTTCAGGGAAACA CALSESGGKLTL
9 Sample_001_AACTTTCAGTTACCCA CALSDLDYSNNRLTL
11 Sample_001_AACTTTCCATTCTTAC CAASMRGSALGRLHF
14 Sample_001_AAGGAGCGTAAACGCG CAVRASSGQKLVF
vdj_data <- data.frame(barcode=sc_vdj_contigs$barcode,
vdj_annotation=paste(sc_vdj_contigs$v_gene, sc_vdj_contigs$d_gene,
sc_vdj_contigs$j_gene, sc_vdj_contigs$c_gene, sep="."))
head(vdj_data)
barcode vdj_annotation
1 Sample_001_AAACCTGAGCTGTCTA TRBV13-1..TRBJ2-2.TRBC2
2 Sample_001_AAACCTGAGCTGTCTA TRAV1..TRAJ37.TRAC
3 Sample_001_AAACCTGGTGAGGGTT TRBV19.TRBD1.TRBJ2-7.TRBC2
4 Sample_001_AACGTTGAGTAGCCGA TRAV6N-7..TRAJ44.TRAC
5 Sample_001_AACGTTGAGTAGCCGA TRBV29..TRBJ2-1.TRBC2
6 Sample_001_AACTTTCAGGGAAACA TRBV29..TRBJ2-1.TRBC2
tra_data <- subset(sc_vdj_contigs, chain=="TRA")
vdj_data <- data.frame(barcode=tra_data$barcode,
vdj_annotation=paste(tra_data$v_gene, tra_data$d_gene,
tra_data$j_gene, tra_data$c_gene, sep="."))
head(vdj_data)
barcode vdj_annotation
1 Sample_001_AAACCTGAGCTGTCTA TRAV1..TRAJ37.TRAC
2 Sample_001_AACGTTGAGTAGCCGA TRAV6N-7..TRAJ44.TRAC
3 Sample_001_AACTTTCAGGGAAACA TRAV6N-7..TRAJ44.TRAC
4 Sample_001_AACTTTCAGTTACCCA TRAV6-6..TRAJ7.TRAC
5 Sample_001_AACTTTCCATTCTTAC TRAV9-1..TRAJ18.TRAC
6 Sample_001_AAGGAGCGTAAACGCG TRAV1..TRAJ16.TRAC
vdj_data <- merge(cluster_ids, vdj_data, by.x=0, by.y=1)
head(vdj_data)
Row.names cluster sample vdj_annotation
1 Sample_001_AAACCTGAGCTGTCTA 14 Sample_001 TRBV13-1..TRBJ2-2.TRBC2
2 Sample_001_AAACCTGAGCTGTCTA 14 Sample_001 TRAV1..TRAJ37.TRAC
3 Sample_001_AAACCTGGTGAGGGTT 9 Sample_001 TRBV19.TRBD1.TRBJ2-7.TRBC2
4 Sample_001_AACGTTGAGTAGCCGA 4 Sample_001 TRAV6N-7..TRAJ44.TRAC
5 Sample_001_AACGTTGAGTAGCCGA 4 Sample_001 TRBV29..TRBJ2-1.TRBC2
6 Sample_001_AACTTTCAGGGAAACA 4 Sample_001 TRAV6N-7..TRAJ44.TRAC
library(dplyr)
vdj_counts <- vdj_data %>% group_by(sample, vdj_annotation) %>% summarize(counts=n())
`summarise()` has grouped output by 'sample'. You can override using the
`.groups` argument.
head(vdj_counts)
# A tibble: 6 × 3
# Groups: sample [1]
sample vdj_annotation counts
<chr> <chr> <int>
1 Sample_001 TRAV1..TRAJ16.TRAC 3
2 Sample_001 TRAV1..TRAJ37.TRAC 1
3 Sample_001 TRAV10D..TRAJ26.TRAC 1
4 Sample_001 TRAV10D..TRAJ37.TRAC 1
5 Sample_001 TRAV10D..TRAJ45.TRAC 1
6 Sample_001 TRAV10N..TRAJ7.TRAC 1
library(tidyr)
vdj_counts_mat <- pivot_wider(vdj_counts, names_from="vdj_annotation", values_from="counts", values_fill=0)
head(vdj_counts_mat)[,1:4]
# A tibble: 6 × 4
# Groups: sample [6]
sample TRAV1..TRAJ16.TRAC TRAV1..TRAJ37.TRAC TRAV10D..TRAJ26.TRAC
<chr> <int> <int> <int>
1 Sample_001 3 1 1
2 Sample_002 0 1 0
3 Sample_003 0 0 0
4 Sample_004 0 0 1
5 Sample_005 0 0 0
6 Sample_006 0 0 0
# The first column has the sample IDs.
# So, set as the row names and drop the first column when converting to matrix.
# Convert to a data frame first so we can add rownames
vdj_counts_mat <- as.data.frame(vdj_counts_mat)
row.names(vdj_counts_mat) <- vdj_counts_mat[,1]
vdj_counts_mat <- as.matrix(vdj_counts_mat[,-1, drop=F])
head(vdj_counts_mat)[,1:3]
TRAV1..TRAJ16.TRAC TRAV1..TRAJ37.TRAC TRAV10D..TRAJ26.TRAC
Sample_001 3 1 1
Sample_002 0 1 0
Sample_003 0 0 0
Sample_004 0 0 1
Sample_005 0 0 0
Sample_006 0 0 0
library(vegan)
Loading required package: permute
Loading required package: lattice
This is vegan 2.6-6.1
Attaching package: 'vegan'
The following object is masked from 'package:VGAM':
calibrate
vdj_diversity <- data.frame(sample=row.names(vdj_counts_mat),
S=diversity(vdj_counts_mat, index="shannon"),
N=specnumber(vdj_counts_mat))
head(vdj_diversity)
sample S N
Sample_001 Sample_001 4.640368 156
Sample_002 Sample_002 4.087640 90
Sample_003 Sample_003 5.001865 220
Sample_004 Sample_004 4.956640 292
Sample_005 Sample_005 5.119334 299
Sample_006 Sample_006 4.551837 199
Once the diversity indices are generated these can be compared with the experimental factors for your biological samples using basic statistical tests, e.g. Kruskal-Wallis or Mann-Whitney. The diversity indices could also be plotted using dot/box plots as a function of particular experimental factors.
For these example we used the following mapping information
mapping <- read.delim("https://wd.cri.uic.edu/scrna/mapping.txt")
head(mapping)
Sample Group
1 Sample_001 A
2 Sample_002 B
3 Sample_003 D
4 Sample_004 C
5 Sample_005 C
6 Sample_006 C
vdj_diversity <- merge(vdj_diversity, mapping, by.x=1, by.y=1)
# Take a peek at the results
head(vdj_diversity)
sample S N Group
1 Sample_001 4.640368 156 A
2 Sample_002 4.087640 90 B
3 Sample_003 5.001865 220 D
4 Sample_004 4.956640 292 C
5 Sample_005 5.119334 299 C
6 Sample_006 4.551837 199 C
# Comparing difference in computed Shannon entropy (S) with group
kruskal.test(S ~ Group, vdj_diversity)
Kruskal-Wallis rank sum test
data: S by Group
Kruskal-Wallis chi-squared = 7.2051, df = 3, p-value = 0.06564
# Comparing difference in richness (N) with group
kruskal.test(N ~ Group, vdj_diversity)
Kruskal-Wallis rank sum test
data: N by Group
Kruskal-Wallis chi-squared = 7.2051, df = 3, p-value = 0.06564
boxplot(S ~ Group, vdj_diversity)
boxplot(N ~ Group, vdj_diversity)
ggplot2library(ggplot2)
ggplot(vdj_diversity, aes(x=Group, y=S)) + geom_boxplot()
ggplot(vdj_diversity, aes(x=Group, y=N)) + geom_boxplot()
NOTE: It may be important to subsample/rarefy the counts data to account for any diversity indices that may be due to the fact that some samples had higher cell counts than others. The following is an example of subsample the data to 500 counts per sample.
vdj_counts_200 <- rrarefy(vdj_counts_mat, 200)
Warning in rrarefy(vdj_counts_mat, 200): some row sums < 'sample' and are not
rarefied
# Filter out any samples that were not subsampled to a depth of 200 (have less than 200 counts)
vdj_counts_200 <- vdj_counts_200[ rowSums(vdj_counts_200) == 200, ]
# Compute the diversity indices for each sample
# Will compute Shannon entropy (S) and richness, a.k.a. number of unique entities (N)
vdj_diversity <- data.frame(sample=row.names(vdj_counts_200),
S=diversity(vdj_counts_200, index="shannon"),
N=specnumber(vdj_counts_200))
head(vdj_diversity)
sample S N
Sample_001 Sample_001 4.396868 105
Sample_003 Sample_003 4.581617 127
Sample_004 Sample_004 4.318617 124
Sample_005 Sample_005 4.501748 119
Sample_006 Sample_006 4.188451 105
Sample_008 Sample_008 4.262527 112